Introduction

This project focuses on replicating aspects of the paper “Dissecting the multicellular ecosystem of metastatic melanoma by single-cell RNA-seq” by Tirosh et al. The biological question of interest is the exploration of the distinct genotypic and phenotypic states of melanoma tumors. There is a need for a deeper understanding of melanoma composition and its effect on the clinical course as the cellular composition of each tumor may exert critical roles in cancer development. Furthermore, classification of tumor heterogeneity may inform treatment through immunological or other mechanisms. This is particularly important in tumors with few treatment options, or in cases where tumors are likely to recur.

The paper “used a multistep approach to distinguish the different cell types within melanoma tumors on the basis of both genetic and transcriptional states,” which we attempted to replicate. For their first step and one of this project’s tasks, large-scale copy number variations (CNVs) were inferred from expression profiles by averaging expression over stretches of 100 genes on their respective chromosomes. For the next task, the cells were grouped according to their expression profiles using t-SNE (nonlinear dimensionality reduction). In addition to t-SNE, we attempted dimensionality reduction using UMAP and clustering with SIMLR. The paper found that in general, “cells designated as malignant by CNV analysis formed a separate cluster for each tumor, suggesting a high degree of intertumor heterogeneity.” The non-malignant cells, however, clustered by cell type and not by tumor or metastatic site.

The Data

We utilized the publicly available dataset GSE72056 containing 4645 single-cell RNA-seq profiles measured from malignant, immune, and stromal cells, isolated from 19 patients that span a range of clinical and therapeutic backgrounds. For the malignant cells, only tumors with >50 malignant cells were included. The paper included 6 tumors on this basis, but we found 8 tumors satisfying the condition, depending on ordering of filtering criteria. For the non-malignant cells, only tumors with >100 non-malignant cells were included. Using this criteria, the paper included 11 tumors in the main figure while we included 12.

Before beginning the analysis, we used the Seurat package to perform quality checks on the data. The ‘SCTransform’ function scaled the residuals to have unit variance and centered the residuals to have mean zero. However, the data was already pre-processed and using centered and scaled data resulted in worse performance compared to using just centered data. ‘SCTransform’ was also used to find the top 300 and 3000 variable genes, used to subset the data is later analysis. As will be seen later, using data containing only the 300 most variable genes improved analysis compared to using only the 3000 most variable genes and also compared to using the complete data.

(insert more about data here)

Task 1

For the first task, we attempted to replicate the following plot using InferCNV.

Copy number variations (CNVs) are genomic alterations, with abnormal deletions or amplifications of one or more gens on segments of chromosomes. And InferCNV is applied to explore the tumor single cell RNA-seq data, detect chromosomal copy number alterations and identify deletions or amplifications of malignant cell gene expression by comparing it to a set of non-malignant cells.

A heatmap is generated directly from the InferCNV, illustrating the relative expression intensities across each chromosome. From the plot, it’s apparent to see which regions of the tumor genome are over-abundant or less-abundant as compared to that of non-malignant cells.

Advanced application of InferCNV also includes methods to predict CNA regions and define cell subclusters according to patterns of heterogeneity, and you can run it step by step for more exploratory purposes, which is not included in this report. Here we just use the standard pipeline to replicate the figure in paper and compare the results.

Use non-malignant cells as reference cells for Mel80:

If we do not indicate reference cells, the average signal across all cells will be used to define the baseline:

We also try to plot heatmap for Mel78 as showed in supplementary materials:

Plot caption: Chromosomal landscape of inferred large-scale CNVs allows us to distingush malignant from non-malignant cells. The Mel80 tumor is shown with individual cells (y axis) and chromosomal regions (x axis). Amplifications (red) or deletions (blue) were inferred by averaging expression over 100-gene stretches on the respective chromosomes. Inferred CNVs are concordant with calls from WES.

Task 2

For the second task, we attempted to replicate the following plots using various visualization and clustering methods. Malignant and non-malignant cell-types were distinguished using single-cell expression profiles. Clusters of non-malignant cells are marking by dashed ellipses and were annotated as T cells, B cells, macrophages, endothelial cells, cancer-associated fibroblasts, and natural killer cells.

t-SNE

(description of t-SNE and explain results ) (insert Eric’s t-SNE plots)

T-distributed Stochastic Neighbor Embedding (t-SNE) is a nonlinear dimensionality reduction technique used for visualizing high-dimensional data such as a single-cell gene expression data into low dimensional space. t-SNE allows us to visualize the relationships between cells based on their ensemble gene expression profiles.

Running t-SNE

We applied t-SNE to the single-cell RNA-seq data from Tirosh et al., separating malignant from non-malignant cells as determined from inferCNV. First, we loaded the data into R, separating the phenotypic and feature data from the single-cell expression matrix:

## Read in data, separate into phenotypic and feature data tables ####
dat <- fread("data/GSE72056_melanoma_single_cell_revised_v2.txt")
edat <- as.matrix(dat[4:nrow(dat), 2:ncol(dat)])
fdat <- data.table(genes=unlist(dat[4:nrow(dat),1]))
fdat$genes[17008] <- paste0(fdat$genes[17008], "_1")
fdat$genes[23202] <- paste0(fdat$genes[23202], "_2")
fdat$genes[8027] <- paste0(fdat$genes[8027], "_1")
fdat$genes[23637] <- paste0(fdat$genes[23637], "_2")
pdat <- data.table(
  cellID=colnames(dat)[2:length(colnames(dat))],
  tumor=unlist(dat[1,2:ncol(dat)]),
  mal=unlist(dat[2,2:ncol(dat)]), #malignant(1=no, 2=yes, 0=unresolved)
  cell_type=unlist(dat[3,2:ncol(dat)]) #non-malignant cell type (1=T, 2=B, 3=Macro., 4=Endo., 5=CAF; 6=NK)
)
rownames(edat) <- fdat$genes

To reduce the complexity of the t-SNE visualization, we restricted the number of tumors used in the analysis according to Tirosh et al. For the non-malignant cell analysis, tumors with at least 100 cells were used:

## Tumors with at least 100 cells
filteredTumors1 <- pdat[,.N, by = tumor] %>%
  filter(., N >= 100) %>%
  select(., tumor) %>%
  pull(., tumor)

Tumors were filtered a second time to include only those with more than 50 malignant cells for the malignant cell analysis:

## Tumors with >50 malignant cells
filteredTumors2 <- pdat[mal ==2 & tumor %in% filteredTumors1, .N, by = tumor] %>%
  filter(., N > 50) %>%
  select(., tumor) %>%
  pull(., tumor)

These filters were used to subset the expression data matrix into nonMalignant and malignant matricies, respectively.

## Define malignant and non-Malignant Cells ####
nonMalignant <- edat[,pdat[tumor %in% filteredTumors1 & mal == 1, cellID]]
malignant <- edat[,pdat[tumor %in% filteredTumors2 & mal == 2, cellID]]

After subsetting the data, we ran the seurat pipeline which involves the following steps:

  1. Create Seurat Object
  2. Normalize data with LogNormalize function
  3. Find and use only the highly Variable Genes
  4. Scale data to regress out confounders such as the detected molecules per cell.
  5. Perform dimensionality reduction (PCA)
  6. Find clusters and perform t-SNE

These steps can be condensed into a single function:

## For Convenience these steps can be condensed into a single function ####
run_seurat <- function(exprsMatrix){
  ## Create Seurat Object ####
  sobj <- CreateSeuratObject(raw.data = exprsMatrix)
  ## Normalization ####
  sobj <- NormalizeData(sobj, normalization.method = 'LogNormalize', scale.factor = 10000)
  ## Finding Highly Variable Genes ###
  sobj <- FindVariableGenes(sobj,
                            mean.function = ExpMean,
                            dispersion.function = LogVMR,
                            x.low.cutoff = 0.0125,
                            x.high.cutoff = 3,
                            y.cutoff = 0.5)
  ## Scaling data to regress out confounders (detected molecules per cell) ####
  sobj <- ScaleData(sobj, vars.to.regress = c('nUMI'))
  ## Dimensionality Reduction ####
  sobj <- RunPCA(sobj, pc.genes = sobj@var.genes, do.print = T, pcs.print = 1:5, genes.print = 5)
  ## Clustering and tSNE Plot ####
  sobj <- FindClusters(sobj, reduction.type = "pca", dims.use = 1:8, resolution = 1.0, print.output = 0, save.SNN = T)
  sobj <- RunTSNE(sobj, dims.use = 1:8)
  TSNEPlot(sobj)
  return(sobj)
}

Below are expression QC plots for malignant and non-malignant cells, respectively, that visualize and plot both gene and expression counts for all cells.

Malignant:
Non-malignant:

t-SNE Results

We ran the seurat pipeline on the malignant, non-malignant, and all cells combined using the function with the parameters shown above. The resulting t-SNE embeddings were extracted along with the phenotypic data (i.e. tumor and cell_type). The results were plotted and colored according to all different combinations of the phenotypic data:

## Run Seurat Pipeline ####
sobjM <- run_seurat(malignant)
sobjN <- run_seurat(nonMalignant)
sobjC <- run_seurat(edat)

## Extract tSNE Data for Malignant Cells ####
mtSNE <- data.table(
  cellID = rownames(sobjM@dr$tsne@cell.embeddings),
  tSNE_1 = sobjM@dr$tsne@cell.embeddings[,1],
  tSNE_2 = sobjM@dr$tsne@cell.embeddings[,2],
  tumor = factor(pdat$tumor[pdat$cellID %in% colnames(malignant)]),
  cell_type = factor(pdat$cell_type[pdat$cellID %in% colnames(malignant)])
)

## Extract tSNE Data for non-Malignant Cells ####
ntSNE <- data.table(
  cellID = rownames(sobjN@dr$tsne@cell.embeddings),
  tSNE_1 = sobjN@dr$tsne@cell.embeddings[,1],
  tSNE_2 = sobjN@dr$tsne@cell.embeddings[,2],
  tumor = factor(pdat$tumor[pdat$cellID %in% colnames(nonMalignant)]),
  cell_type = factor(pdat$cell_type[pdat$cellID %in% colnames(nonMalignant)])
)

## Extract tSNE Data for All Cells ####
ctSNE <- data.table(
  cellID = rownames(sobjC@dr$tsne@cell.embeddings),
  tSNE_1 = sobjC@dr$tsne@cell.embeddings[,1],
  tSNE_2 = sobjC@dr$tsne@cell.embeddings[,2],
  tumor = factor(pdat$tumor[pdat$cellID %in% colnames(edat)]),
  cell_type = factor(pdat$cell_type[pdat$cellID %in% colnames(edat)])
)
All Cells Colored by tumor (left) and cell type (right)
Malignant Cells (left) Non-malignant Cells (right) colored by tumor
Malignant Cells (left) Non-malignant Cells (right) colored by cell type
Malignant Cells colored by tumor (left) and Non-malignant Cells colored by cell type (right)
Comparison to paper:

Identifying Clusters

To identify the cluster groups for the non-malignant cells we used Density-Based Spatial Clustering and Application with Noise (DBSCAN) on the t-SNE visualization as in Tirosh et al. The plot below shows the t-SNE before and after coloring by the clusters found with DBSCAN. In Tirosh et al. they used DBSCAN parameters eps = 6 and MinPts = 10, but we were unable to recover the same clusters using these parameters, but eps = 2 and MinPts = 10 recovered many of the clusters.

## Identifying Clusters with DBScan ####
kNNdistplot(sobjN@dr$tsne@cell.embeddings, k=10)
abline(h=2)
db <- fpc::dbscan(sobjN@dr$tsne@cell.embeddings, eps = 2, MinPts = 10)

Identifying Clusters

In Tirosh et al., the significant marker genes were identified to determine cell identity of each cluster. We took the genes for each cell type defined in the supplement and created gene signatures to identify the same clusters in our t-SNE plots. The mean expression value of each signature for each cell was computed and used to identify each cluster as shown below:

## Compile Gene Signature Markers and Make Feature Plots ####
markers <- list(
  tcells=c('CD2','CD3D','CD3E','CD3G','CD8A','SIRPG','TIGIT','GZMK','ITK','SH2D1A','CD247','PRF1','NKG7',
           'IL2RB','SH2D2A','KLRK1','ZAP70','CD7','CST7','LAT','PYHIN1','SLA2','STAT4','CD6','CCL5','CD96',
           'TC2N','FYN','LCK','TCF7','TOX','IL32','SPOCK2','SKAP1','CD28','CBLB','APOBEC3G','PRDM1'),
  bcells=c("CD19", "CD79A", "CD79B", "BLK", "MS4A1", "BANK1", "IGLL3P", 
           "FCRL1", "PAX5", "CLEC17A", "CD22", "BCL11A", "VPREB3", "HLA-DOB", 
           "STAP1", "FAM129C", "TLR10", "RALGPS2", "AFF3", "POU2AF1", "CXCR5", 
           "PLCG2", "HVCN1", "CCR6", "P2RX5", "BLNK", "KIAA0226L", "POU2F2", 
           "IRF8", "FCRLA", "CD37"),
  macro=c("CD163", "CD14", "CSF1R", "C1QC", "VSIG4", "C1QA", "FCER1G", 
          "F13A1", "TYROBP", "MSR1", "C1QB", "MS4A4A", "FPR1", "S100A9", 
          "IGSF6", "LILRB4", "FPR3", "SIGLEC1", "LILRA1", "LYZ", "HK3", 
          "SLC11A1", "CSF3R", "CD300E", "PILRA", "FCGR3A", "AIF1", "SIGLEC9", 
          "FCGR1C", "OLR1", "TLR2", "LILRB2", "C5AR1", "FCGR1A", "MS4A6A", 
          "C3AR1", "HCK", "IL4I1", "LST1", "LILRA5", "CSTA", "IFI30", "CD68", 
          "TBXAS1", "FCGR1B", "LILRA6", "CXCL16", "NCF2", "RAB20", "MS4A7", 
          "NLRP3", "LRRC25", "ADAP2", "SPP1", "CCR1", "TNFSF13", "RASSF4", 
          "SERPINA1", "MAFB", "IL18", "FGL2", "SIRPB1", "CLEC4A", "MNDA", 
          "FCGR2A", "CLEC7A", "SLAMF8", "SLC7A7", "ITGAX", "BCL2A1", "PLAUR", 
          "SLCO2B1", "PLBD1", "APOC1", "RNF144B", "SLC31A2", "PTAFR", "NINJ1", 
          "ITGAM", "CPVL", "PLIN2", "C1orf162", "FTL", "LIPA", "CD86", 
          "GLUL", "FGR", "GK", "TYMP", "GPX1", "NPL", "ACSL1"),
  endo=c("PECAM1", "VWF", "CDH5", "CLDN5", "PLVAP", "ECSCR", "SLCO2A1", 
         "CCL14", "MMRN1", "MYCT1", "KDR", "TM4SF18", "TIE1", "ERG", "FABP4", 
         "SDPR", "HYAL2", "FLT4", "EGFL7", "ESAM", "CXorf36", "TEK", "TSPAN18", 
         "EMCN", "MMRN2", "ELTD1", "PDE2A", "NOS3", "ROBO4", "APOLD1", 
         "PTPRB", "RHOJ", "RAMP2", "GPR116", "F2RL3", "JUP", "CCBP2", 
         "GPR146", "RGS16", "TSPAN7", "RAMP3", "PLA2G4C", "TGM2", "LDB2", 
         "PRCP", "ID1", "SMAD1", "AFAP1L1", "ELK3", "ANGPT2", "LYVE1", 
         "ARHGAP29", "IL3RA", "ADCY4", "TFPI", "TNFAIP1", "SYT15", "DYSF", 
         "PODXL", "SEMA3A", "DOCK9", "F8", "NPDC1", "TSPAN15", "CD34", 
         "THBD", "ITGB4", "RASA4", "COL4A1", "ECE1", "GFOD2", "EFNA1", 
         "PVRL2", "GNG11", "HERC2P2", "MALL", "HERC2P9", "PPM1F", "PKP4", 
         "LIMS3", "CD9", "RAI14", "ZNF521", "RGL2", "HSPG2", "TGFBR2", 
         "RBP1", "FXYD6", "MATN2", "S1PR1", "PIEZO1", "PDGFA", "ADAM15", 
         "HAPLN3", "APP"),
  caf=c("FAP", "THY1", "DCN", "COL1A1", "COL1A2", "COL6A1", "COL6A2", 
        "COL6A3", "CXCL14", "LUM", "COL3A1", "DPT", "ISLR", "PODN", "CD248", 
        "FGF7", "MXRA8", "PDGFRL", "COL14A1", "MFAP5", "MEG3", "SULF1", 
        "AOX1", "SVEP1", "LPAR1", "PDGFRB", "TAGLN", "IGFBP6", "FBLN1", 
        "CA12", "SPOCK1", "TPM2", "THBS2", "FBLN5", "TMEM119", "ADAM33", 
        "PRRX1", "PCOLCE", "IGF2", "GFPT2", "PDGFRA", "CRISPLD2", "CPE", 
        "F3", "MFAP4", "C1S", "PTGIS", "LOX", "CYP1B1", "CLDN11", "SERPINF1", 
        "OLFML3", "COL5A2", "ACTA2", "MSC", "VASN", "ABI3BP", "C1R", 
        "ANTXR1", "MGST1", "C3", "PALLD", "FBN1", "CPXM1", "CYBRD1", 
        "IGFBP5", "PRELP", "PAPSS2", "MMP2", "CKAP4", "CCDC80", "ADAMTS2", 
        "TPM1", "PCSK5", "ELN", "CXCL12", "OLFML2B", "PLAC9", "RCN3", 
        "LTBP2", "NID2", "SCARA3", "AMOTL2", "TPST1", "MIR100HG", "CTGF", 
        "RARRES2", "FHL2"),
  nk=c("KLRB1", "KLRC1", "KLRD1", "KLRF1"),
  cd8T=c("CD8A", "CD8B")
)

temp <- nonMalignant
tcell_sig <- sapply(1:ncol(temp), function(x) temp[which(fdat$genes %in% markers$tcells),x] %>% mean)
bcell_sig <- sapply(1:ncol(temp), function(x) temp[which(fdat$genes %in% markers$bcells),x] %>% mean)
macro_sig <- sapply(1:ncol(temp), function(x) temp[which(fdat$genes %in% markers$macro),x] %>% mean)
endo_sig <- sapply(1:ncol(temp), function(x) temp[which(fdat$genes %in% markers$endo),x] %>% mean)
caf_sig <- sapply(1:ncol(temp), function(x) temp[which(fdat$genes %in% markers$caf),x] %>% mean)
nk_sig <- sapply(1:ncol(temp), function(x) temp[which(fdat$genes %in% markers$nk),x] %>% mean)
cd8T_sig <- sapply(1:ncol(temp), function(x) temp[which(fdat$genes %in% markers$cd8T),x] %>% mean)
temp <- rbind(temp, tcell_sig, bcell_sig, macro_sig, endo_sig, caf_sig, nk_sig, cd8T_sig)
rownames(temp)[nrow(temp)]

## Run tSNE Pipeline ####
temp_obj <- run_seurat(temp)

## Feature Plots of 'Cell Markers' ####
p <- FeaturePlot(temp_obj, c("tcell_sig", "bcell_sig", "macro_sig", "endo_sig", "caf_sig", "nk_sig", "cd8T_sig"),
                 cols.use = c("lightgrey", "blue"), nCol = 4, no.legend = F, do.return = T)

UMAP

Uniform Manifold Approximation and Project (UMAP) is a novel manifold learning technique for dimension reduction competitive with t-SNE for data visualization. Before running UMAP, linear dimensional reduction (principal components analysis) was performed. Then UMAP was run on both the untransformed complete data, and on the centered data containing only the 300 and 3000 most highly variable genes. In the UMAP plots below, it can be seen that the malignant cells cluster by tumor origin. In all three cases, distinct clustering can be seen, although in varying degrees. Arguably, t-SNE performs the best in terms of creating distinct separation between clusters. In the t-SNE plot, the non-malignant cells appeared to cluster by cell-type. This is somewhat the case for UMAP run on the centered data using the top 300 most variable genes. However, the T cells form two separate clusters, while some of the other cell-types are lost within the T cells. For UMAP run on the complete data and on the top 3000 most variable genes, one does not observe clustering by cell-type. Clustering by tumor is not apparent either.

SIMLR

Single-cell Interpretaion via Multi-kernal LeaRning (SIMLR) is a similarity-learning frameowrk that learns an appropriate distance metric from the data for purposes of dimension reduction, clustering, and visualization. SIMLR and SIMLR large scale was run on the transformed data using only the 300 most variable genes. The normalized mutual information (NMI) was computed between the clusters inferred by SIMLR large scale and the ground truth clusters. NMI scores range from 0 and 1, with higher values reflecting better performance. Again, in the t-SNE plots, malignant cells cluster by tumor origin. The same is true for clusters identified by SIMLR. The clusters found by SIMLR (NMI=0.6077) are well-separated but a few tumors were clustered together, as can be seen below. SIMLR large scale performed much better than SIMLR (NMI=0.8518) in clustering the malignant cells. SIMLR had the worst performance in clustering the non-malignant cells (NMI=0.5708). As in UMAP, SIMLR created two clusters for T cells while other cell-types were lost within the T cells. There is some clustering by cell-type, but it is not as apparent as in the t-SNE plots, and clustering by tumor is not apparent either.

References